Alumnos: Cristhian Rodriguez y Jesus Perucha

Practica 3: Titanic



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt

Importamos los datos para entrenar y testear



In [2]:

    
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Miramos los datos, para ver que si hay nulos o datos que rellenar, como la edad y la cabina en este caso



In [3]:

    
print(train_df.columns.values)
train_df.isnull().sum()









    



['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']






    Out[3]:





PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Faltan muchos datos de edad y cabina por rellenar, ademas de 2 embarcos

Miramos los tipos de los datos



In [4]:

    
print (train_df.info())
train_df.describe()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None






    Out[4]:






  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
    
    
      std
      257.353842
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
    
    
      50%
      446.000000
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      668.500000
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200

Como faltan mas de la mitad de los datos de la cabina y no contienen informacion util, se puede descartar esta feature.

Tambien vamos a quitar ticket, porque no hay relacion ninguna entre los nombres de los tickets



In [5]:

    
train_df, test_df = train_df.drop(['Cabin', 'Ticket'], axis=1), test_df.drop(['Cabin', 'Ticket'], axis=1)



In [6]:

    
# Sacamos la descripcion de los valores que son Strings (object)
train_df.describe(include=['O'])









    Out[6]:






  
    
      
      Name
      Sex
      Embarked
    
  
  
    
      count
      891
      891
      889
    
    
      unique
      891
      2
      3
    
    
      top
      Wheadon, Mr. Edward H
      male
      S
    
    
      freq
      1
      577
      644

Analisis a primera vista de los datos

Hay que rellenar los datos que faltan para poder usarlo en los algoritmos de entrenamiento
Hay que pasar todos los strings a valores numericos para poder usarlo en los algoritmos de entrenamiento
Hay que descartar features que sean inutiles o crear nuevas features a partir de las nuevas para entrenar

Visualizacion grafica de la relacion entre las features



In [7]:

    
plt.title('Survival count between sex', size=20, y=1.1)
sns.countplot(x = 'Survived', hue='Sex', data=train_df)
#Hay una gran correlacion entre el sexo y la supervivencia









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae5a49f60>



In [8]:

    
# Pasamos el sexo de string a un int, 1 para hombre y 0 para mujer
for df in [train_df, test_df]:
    df['Sex'] = df['Sex'].apply(lambda x : 1 if x == 'male' else 0)



In [9]:

    
# Hay relacion directa entre la clase y la supervivencia
plt.figure(figsize=(12, 12))
plt.subplot(2,2,1)
plt.title('Survival rate / Pclass', size=15, y=1.1)
sns.barplot(x='Pclass', y = 'Survived', data=train_df, palette='muted')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae5b59780>



In [10]:

    
sns.countplot(x = 'Survived', hue='Embarked', data=train_df)
# Tambien hay una ligera correlacion con el lugar de embarque









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae5bd22e8>

Como faltan 2 datos de embarque de 2 personas y usaremos la feature, rellenamos con S porque es donde la mayoria de las personas lo han hecho y hay menos riesgo de falsear las features. Tambien pasamos de S,C,Q a valores enteros para entrenarlos



In [11]:

    
train_df['Embarked'] = train_df['Embarked'].fillna('S')
for dt in [train_df, test_df]:
    dt['Embarked'] = dt['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)



In [12]:

    
#Rellenamos el unico valor que falta de fare
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())



In [13]:

    
# Transformamos los valores continuos de fare en valores discretos, agrupando los rangos en 4 grupos, del 0 al 3
for df in [train_df, test_df]:
    df['Fare'] = pd.qcut(df['Fare'], 4, labels=[0, 1, 2, 3])

train_df.head(5)









    Out[13]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      1
      22.0
      1
      0
      0
      0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      38.0
      1
      0
      3
      1
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      0
      26.0
      0
      0
      1
      0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      35.0
      1
      0
      3
      0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      1
      35.0
      0
      0
      1
      0

Como Parch es la abreviacion de 'parent/children', sumado y SibSp es la abreviacion de 'sibling/spouse' sumados, se pueden juntar estas 2 features en una sola que representen el tamaño de la familia que tiene esa persona, incluyendola. Sacamos la grafica para ver la relacion que hay



In [14]:

    
for df in [train_df, test_df]:
    df['FamilySize'] = df['Parch'] + df['SibSp'] + 1



In [15]:

    
sns.barplot(x='FamilySize', y='Survived' , data=train_df)









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae6dae940>

De esta grafica podemos ver que las personas con 2,3 o 4 de tamaño familiar, tenian mas posibilidades de supervivencia Asi que vamos a simplicar esta nueva feature en 0 si esta fuera de 2,3 o 4 miembros en el barco y 1 si lo esta. Con esto las features Parch y SibSp no hacen falta



In [16]:

    
def filter_family_size(x):
    if x == 1:
        return 0
    elif x < 5:
        return 1
    else:
        return 0

for df in [train_df, test_df]:
    df['FamilySize'] = df['FamilySize'].apply(filter_family_size)



In [16]:



In [17]:

    
train_df = train_df.drop(['Parch', 'SibSp'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp'], axis=1)

Rellenar la edad

La forma mas precisa de hacerlo es usando la mediana y la correlaciones que la edad tiene con otras features,en este caso las mas correladas son el genero y pclass, como se ve en el diagrama de calor de abajo.

A partir de la edad he creado una nueva feature con el rango de edades, para ver la supervivencia entre rangos



In [18]:

    
corrmat = train_df.corr()
sns.heatmap(corrmat, square=True)
print ("El numero de datos Age sin rellenar: ",train_df['Age'].isnull().sum())









    



El numero de datos Age sin rellenar:  177



In [19]:

    
plt.title('Distribucion de la edad original', size=20, y=1.1)
sns.distplot(train_df['Age'].dropna())









    



F:\Archivos de Programa\Anaconda3_4.3.1\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae6d97390>



In [20]:

    
#Rellenamos los campos edad vacios
guess_ages = np.zeros((2,3))
for dataset in [train_df, test_df]:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

print ("El numero de datos Age sin rellenar: ",train_df['Age'].isnull().sum())









    



El numero de datos Age sin rellenar:  0

Al haber introducido los nuevos datos sobre la media, la distribucion sigue siendo igual a antes de introducirlos, pero con un repunte de datos en la zona de la mediana



In [21]:

    
plt.title('Distribucion de la edad rellena', size=20, y=1.1)
sns.distplot(train_df['Age'])









    



F:\Archivos de Programa\Anaconda3_4.3.1\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae70f2860>



In [22]:

    
#Creamos la nueva feature y la mostramos
train_df['AgeBand'] = pd.cut(train_df['Age'], 8)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
sns.countplot(x='Survived', hue='AgeBand' , data=train_df)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x20ae720a5c0>

Convertimos el campo edad en valores de 0 al 7 siguiendo la feature banda de edades que hemos creado antes, con este cambio, banda de edades es una feature que no necesitamos ya



In [23]:

    
for dataset in [train_df, test_df]:    
    dataset.loc[ dataset['Age'] <= 10, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 10) & (dataset['Age'] <= 20), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 30), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 30) & (dataset['Age'] <= 40), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 50), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 50) & (dataset['Age'] <= 60), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 60) & (dataset['Age'] <= 70), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 70, 'Age'] = 7
train_df.head()









    Out[23]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      Fare
      Embarked
      FamilySize
      AgeBand
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      1
      2
      0
      0
      1
      (20, 30]
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      3
      3
      1
      1
      (30, 40]
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      0
      2
      1
      0
      1
      (20, 30]
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      3
      3
      0
      1
      (30, 40]
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      1
      3
      1
      0
      1
      (30, 40]



In [25]:

    
train_df = train_df.drop(['AgeBand'], axis=1)

Clasificamos el nombre segun el titulo de una persona



In [24]:

    
# Filter the name
def get_title(x):
    y = x[x.find(',')+1:].replace('.', '').replace(',', '').strip().split(' ')
    if y[0] == 'the':    # Search for the countess
        title = y[1]
    else:
        title = y[0]
    return title

def filter_title(title, sex):
    if title in ['Countess', 'Dona', 'Lady', 'Jonkheer', 'Mme', 'Mlle', 'Ms', 'Capt', 'Col', 'Don', 'Sir', 'Major', 'Rev', 'Dr']:
        if sex:
            return 'Rare_male'
        else:
            return 'Rare_female'
    else:
        return title

for df in [train_df, test_df]:
    df['NameLength'] = df['Name'].apply(lambda x : len(x))
    df['Title'] = df['Name'].apply(get_title)



In [26]:

    
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in [train_df, test_df]:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)



In [27]:

    
#Quitamos los titulos especiales y los agrupamos en categorias mas concretas
for df in [train_df, test_df]:
    df['Title'] = df.apply(lambda x: filter_title(x['Title'], x['Sex']), axis=1)

sns.countplot(y=train_df['Title'])
train_df.groupby('Title')['PassengerId'].count().sort_values(ascending=False)









    Out[27]:





Title
1.0    517
2.0    182
3.0    125
4.0     40
0.0     27
Name: PassengerId, dtype: int64



In [28]:

    
# Borramos la columna Name
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)



In [29]:

    
train_df.head()

Eleccion del Modelo



In [30]:

    
X_train = train_df.drop(["Survived"], axis=1).copy()
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape









    Out[30]:





((891, 8), (891,), (418, 8))



In [ ]:

    
X_test.head()



In [ ]:

    
X_train.head()

Random Forest



In [31]:

    
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=101)
random_forest.fit(X_train, Y_train)

Y_pred = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

#acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
#acc_random_forest









    Out[31]:





0.95622895622895621

Decision Tree



In [37]:

    
from sklearn import tree
clf = tree.DecisionTreeClassifier()

clf.fit(X_train, Y_train)

Y_pred = clf.predict(X_test)
clf.score(X_train, Y_train)









    Out[37]:





0.95622895622895621

Support Vector Machines



In [35]:

    
from sklearn.svm import SVC
svc = SVC(C=10000.0)
svc.fit(X_train, Y_train)

Y_pred = svc.predict(X_test)

svc.score(X_train, Y_train)









    Out[35]:





0.95622895622895621

KNN



In [34]:

    
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)

knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_test)

knn.score(X_train, Y_train)









    Out[34]:





0.8709315375982043

Creamos el archivo submission para subir a kaggle



In [38]:

    
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

Lo guardamos en formato csv



In [39]:

    
submission.to_csv('submission.csv', index=False)

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Name	Sex	Embarked
count	891	891	889
unique	891	2	3
top	Wheadon, Mr. Edward H	male	S
freq	1	577	644

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	3	1
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	1	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	3	0
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	1	0

	PassengerId	Survived	Pclass	Name	Sex	Age	Fare	Embarked	FamilySize	AgeBand
0	1	0	3	Braund, Mr. Owen Harris	1	2	0	0	1	(20, 30]
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	3	3	1	1	(30, 40]
2	3	1	3	Heikkinen, Miss. Laina	0	2	1	0	1	(20, 30]
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	3	3	0	1	(30, 40]
4	5	0	3	Allen, Mr. William Henry	1	3	1	0	1	(30, 40]

	Survived	Pclass	Sex	Age	Fare	Embarked	FamilySize	NameLength	Title
0	0	3	1	2	0	0	1	23	1.0
1	1	1	0	3	3	1	1	51	3.0
2	1	3	0	2	1	0	1	22	2.0
3	1	1	0	3	3	0	1	44	3.0
4	0	3	1	3	1	0	1	24	1.0

	Survived	Pclass	Sex	Age	Fare	Embarked	FamilySize	NameLength	Title
0	0	3	1	2	0	0	1	23	1.0
1	1	1	0	3	3	1	1	51	3.0
2	1	3	0	2	1	0	1	22	2.0
3	1	1	0	3	3	0	1	44	3.0
4	0	3	1	3	1	0	1	24	1.0

	Survived	Pclass	Sex	Age	Fare	Embarked	FamilySize	NameLength	Title
0	0	3	1	2	0	0	1	23	1.0
1	1	1	0	3	3	1	1	51	3.0
2	1	3	0	2	1	0	1	22	2.0
3	1	1	0	3	3	0	1	44	3.0
4	0	3	1	3	1	0	1	24	1.0